Statistical Modelling of COVID-19 Outbreak in Italy

09 May 2020



Nonlinear growth models

Nonlinear growth models represent an instance of nonlinear regression models, a class of models taking the general form \[ y = \mu(x, \theta) + \epsilon, \] where \(\mu(x, \theta)\) is the mean function which depends on a possibly vector-valued parameter \(\theta\), and a possibly vector-valued predictor \(x\). The stochastic component \(\epsilon\) represents the error with mean zero and constant variance. Usually, a Gaussian distribution is also assumed for the error term.

By defining the mean function \(\mu(x, \theta)\) we may obtain several different models, all characterized by the fact that parameters \(\theta\) enter in a nonlinear way into the equation. Parameters are usually estimated by nonlinear least squares which aims at minimizing the residual sum of squares.

Exponential

\[ \mu(x) = \theta_1 \exp\{\theta_2 x\} \] where \(\theta_1\) is the value at the origin (i.e. \(\mu(x=0)\)), and \(\theta_2\) represents the (constant) relative ratio of change (i.e. \(\frac{d\mu(x)}{dx }\frac{1}{\mu(x)} = \theta_2\)). Thus, the model describes an increasing (exponential growth if \(\theta_2 > 0\)) or decreasing (exponential decay if \(\theta_2 < 0\)) trend with constant relative rate.

Logistic

\[ \mu(x) = \frac{\theta_1}{1+\exp\{(\theta_2 - x)/\theta_3\}} \] where \(\theta_1\) is the upper horizontal asymptote, \(\theta_2\) represents the x-value at the inflection point of the symmetric growth curve, and \(\theta_3\) represents a scale parameter (and \(1/\theta_3\) is the growth-rate parameter that controls how quickly the curve approaches the upper asymptote).

Gompertz

\[ \mu(x) = \theta_1 \exp\{-\theta_2 \theta_3^x\} \] where \(\theta_1\) is the horizontal asymptote, \(\theta_2\) represents the value of the function at \(x = 0\) (displacement along the x-axis), and \(\theta_3\) represents a scale parameter.

The difference between the logistic and Gompertz functions is that the latter is not symmetric around the inflection point.

Richards

\[ \mu(x) = \theta_1 (1 - \exp\{-\theta_2 x\})^{\theta_3} \] where \(\theta_1\) is the horizontal asymptote, \(\theta_2\) represents the rate of growth, and \(\theta_3\) in part determines the point of inflection on the y-axis.

Data

Dipartimento della Protezione Civile: COVID-19 Italia - Monitoraggio della situazione http://arcg.is/C1unv

Source: https://github.com/pcm-dpc/COVID-19

## # Dati COVID-19 Italia
## 
## ## Avvisi
## 
## ```diff
## - 08/05/2020: dati Regione Basilicata ricalcolo casi positivi (diminuzione)
## - 07/05/2020: dati Regione Basilicata ricalcolo casi positivi (diminuzione)
## - 06/05/2020: dati Regione Lombardia aggiornamento dimessi guariti (aumento)
## - 04/05/2020: dati Regione Sardegna ricalcolo nuovi casi e guariti
## - 02/05/2020: dati Regione Lombardia ricalcolati 329 decessi (47 di oggi e 282 da riconteggio di aprile)
## - 01/05/2020: dati Regione Lazio ricalcolati 41 decessi (8 nelle ultime 48 ore e 33 ad aprile)
## - 26/04/2020: dati Regione Valle d'Aosta ricalcolati (casi testati)
## - 24/04/2020: dati Regione Sardegna ricalcolati (1.237 tamponi aggiunti)
## - 24/04/2020: dati Regione Friuli Venezia Giulia in fase di revisione su dimessi/guariti
## - 23/04/2020: dati Regione Lazio parziali (casi testati non completi)
## - 23/04/2020: dati Regione Campania parziali (casi testati non aggiornati)
## - 21/04/2020: dati Regione Lombardia parziali (casi testati non aggiornati)
## - 20/04/2020: dati Regione Lombardia ricalcolati (ricalcolo di casi testati - eliminazione duplicati)
## - 15/04/2020: dati Regione Friuli Venezia Giulia ricalcolati (ricalcolo di isolamento domiciliare e dimessi/guariti)
## - 12/04/2020: dati P.A. Bolzano ricalcolati (ricalcolo dati guariti -110 rispetto a ieri)
## - 10/04/2020: dati Regione Molise parziali (dato tamponi non aggiornato)
## - 29/03/2020: dati Regione Emilia-Romagna parziali (dato tamponi non aggiornato)
## - 26/03/2020: dati Regione Piemonte parziali (-50 deceduti - comunicazione tardiva)
## - 18/03/2020: dati Regione Campania non pervenuti
## - 18/03/2020: dati Provincia di Parma non pervenuti
## - 17/03/2020: dati Provincia di Rimini non aggiornati
## - 16/03/2020: dati P.A. Trento e Puglia non pervenuti
## - 11/03/2020: dati Regione Abruzzo non pervenuti
## - 10/03/2020: dati Regione Lombardia parziali
## - 07/03/2020: dati Brescia +300 esiti positivi
## ```
url = "https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-andamento-nazionale/dpc-covid19-ita-andamento-nazionale.csv"
COVID19 <- read.csv(file = url, stringsAsFactors = FALSE)
COVID19$data <- as.Date(COVID19$data)
# DT::datatable(COVID19)


Modelling total infected

# create data for analysis
data = data.frame(date = COVID19$data,
                  y = COVID19$totale_casi,
                                    dy = reldiff(COVID19$totale_casi))
data$x = as.numeric(data$date) - min(as.numeric(data$date)) + 1
DT::datatable(data, options = list("pageLength" = 5))

Estimation

Exponential

mod1_start = lm(log(y) ~ x, data = data)
b = unname(coef(mod1_start))
start = list(th1 = exp(b[1]), th2 = b[2])
mod1 = nls(y ~ exponential(x, th1, th2), data = data, start = start)
summary(mod1)
## 
## Formula: y ~ exponential(x, th1, th2)
## 
## Parameters:
##         Estimate   Std. Error t value Pr(>|t|)    
## th1 29791.950195  3045.896480   9.781 5.65e-15 ***
## th2     0.028945     0.001649  17.556  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27900 on 74 degrees of freedom
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 0.000005847

Logistic

mod2 = nls(y ~ SSlogis(x, Asym, xmid, scal), data = data)
summary(mod2)
## 
## Formula: y ~ SSlogis(x, Asym, xmid, scal)
## 
## Parameters:
##         Estimate  Std. Error t value Pr(>|t|)    
## Asym 214074.8298   2087.1683  102.57   <2e-16 ***
## xmid     38.3957      0.3400  112.94   <2e-16 ***
## scal      9.5602      0.2606   36.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5836 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.0000008347

Gompertz

mod3 = nls(y ~ SSgompertz(x, Asym, b2, b3), data = data)
# start = list(Asym = coef(mod2)[1])
# tmp = list(y = log(log(start$Asym) - log(data$y)), x = data$x)
# b = unname(coef(lm(y ~ x, data = tmp)))
# start = c(start, c(b2 = exp(b[1]), b3 = exp(b[2])))
# mod3 = nls(y ~ SSgompertz(x, Asym, b2, b3), data = data, start = start,
#            control = nls.control(maxiter = 1000))
summary(mod3)
## 
## Formula: y ~ SSgompertz(x, Asym, b2, b3)
## 
## Parameters:
##            Estimate     Std. Error t value Pr(>|t|)    
## Asym 233259.2927217   1187.7082152  196.39   <2e-16 ***
## b2        7.7628649      0.1533278   50.63   <2e-16 ***
## b3        0.9408388      0.0006671 1410.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1963 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000003873

Richards

richards <- function(x, th1, th2, th3) th1*(1 - exp(-th2*x))^th3
Loss  <- function(th, y, x) sum((y - richards(x, th[1], th[2], th[3]))^2) 
start <- optim(par = c(coef(mod2)[1], 0.001, 1), fn = Loss, 
               y = data$y, x = data$x)$par
names(start) <- c("th1", "th2", "th3")
mod4 = nls(y ~ richards(x, th1, th2, th3), data = data, start = start,
           # trace = TRUE, algorithm = "plinear", 
           control = nls.control(maxiter = 1000, tol = 0.1))
# algorithm is not converging... 
summary(mod4)
## 
## Formula: y ~ richards(x, th1, th2, th3)
## 
## Parameters:
##           Estimate     Std. Error t value Pr(>|t|)    
## th1 239939.4325467   1164.6166084  206.02   <2e-16 ***
## th2      0.0532886      0.0006493   82.07   <2e-16 ***
## th3      5.5580824      0.1104457   50.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1474 on 73 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 0.07423
# library(nlmrt)
# mod4 = nlxb(y ~ th1*(1 - exp(-th2*x))^th3, 
#             data = data, start = start, trace = TRUE)

Models comparison

models = list("Exponential model" = mod1, 
              "Logistic model" = mod2, 
              "Gompertz model" = mod3,
              "Richards model" = mod4)
tab = data.frame(loglik = sapply(models, logLik),
                 df = sapply(models, function(m) attr(logLik(m), "df")),
                 Rsquare = sapply(models, function(m) 
                                  cor(data$y, fitted(m))^2),
                 AIC = sapply(models, AIC),
                 AICc = sapply(models, AICc),
                 BIC = sapply(models, BIC))
sel <- apply(tab[,4:6], 2, which.min)
tab$"" <- sapply(tabulate(sel, nbins = length(models))+1, symnum,
                 cutpoints = 0:4, symbols = c("", "*", "**", "***"))
knitr::kable(tab)
loglik df Rsquare AIC AICc BIC
Exponential model -884.7793 3 0.8883066 1775.559 1775.892 1782.551
Logistic model -765.3666 4 0.9954928 1538.733 1539.297 1548.056
Gompertz model -682.5534 4 0.9994153 1373.107 1373.670 1382.430
Richards model -660.7729 4 0.9996644 1329.546 1330.109 1338.869 ***
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(aes(y = fitted(mod1), color = "Exponential")) +
  geom_line(aes(y = fitted(mod2), color = "Logistic")) +
  geom_line(aes(y = fitted(mod3), color = "Gompertz")) +
  geom_line(aes(y = fitted(mod4), color = "Richards")) +
  labs(x = "", y = "Infected", color = "Model") +
  scale_color_manual(values = cols) +
  scale_y_continuous(breaks = seq(0, coef(mod2)[1], by = 10000),
                     minor_breaks = seq(0, coef(mod2)[1], by = 5000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

last_plot() +
  scale_y_continuous(trans = "log10", limits = c(100,NA)) +
  labs(y = "Infected (log10 scale)")

Predictions

Point estimates

df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1),
               fit1 = predict(mod1, newdata = df),
               fit2 = predict(mod2, newdata = df),
               fit3 = predict(mod3, newdata = df),
               fit4 = predict(mod4, newdata = df))
ylim = c(0, max(df[,c("fit2", "fit3")]))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(data = df, aes(x = date, y = fit1, color = "Exponential")) +
  geom_line(data = df, aes(x = date, y = fit2, color = "Logistic")) +
  geom_line(data = df, aes(x = date, y = fit3, color = "Gompertz")) +
  geom_line(data = df, aes(x = date, y = fit4, color = "Richards")) +
  coord_cartesian(ylim = ylim) +
  labs(x = "", y = "Infected", color = "Model") +
  scale_y_continuous(breaks = seq(0, max(ylim), by = 10000),
                     minor_breaks = seq(0, max(ylim), by = 5000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Prediction intervals

# compute prediction using Moving Block Bootstrap (MBB) for nls
df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1))

pred1 = cbind(df, "fit" = predict(mod1, newdata = df))
pred1[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod1, df[df$x > max(data$x),])[,2:3]

pred2 = cbind(df, "fit" = predict(mod2, newdata = df))
pred2[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod2, df[df$x > max(data$x),])[,2:3]

pred3 = cbind(df, "fit" = predict(mod3, newdata = df))
pred3[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod3, df[df$x > max(data$x),])[,2:3]

pred4 = cbind(df, "fit" = predict(mod4, newdata = df))
pred4[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod4, df[df$x > max(data$x),])[,2:3]

# predictions for next day
pred = rbind(subset(pred1, x == max(data$x)+1, select = 2:5),
             subset(pred2, x == max(data$x)+1, select = 2:5),
             subset(pred3, x == max(data$x)+1, select = 2:5),
             subset(pred4, x == max(data$x)+1, select = 2:5))
print(pred, digits = 3)
##           date    fit    lwr    upr
## 77  2020-05-10 276712 212865 345145
## 771 2020-05-10 210366 197408 221851
## 772 2020-05-10 217292 212633 221822
## 773 2020-05-10 218722 215074 222140

ylim = c(0, max(pred2$upr, pred3$upr, na.rm=TRUE))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(data = pred1, aes(x = date, y = fit, color = "Exponential")) +
  geom_line(data = pred2, aes(x = date, y = fit, color = "Logistic")) +
  geom_line(data = pred3, aes(x = date, y = fit, color = "Gompertz")) +
  geom_line(data = pred4, aes(x = date, y = fit, color = "Richards")) +
  geom_ribbon(data = pred1, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[1], alpha=0.3) +
  geom_ribbon(data = pred2, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[2], alpha=0.3) +
  geom_ribbon(data = pred3, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[3], alpha=0.3) +
  geom_ribbon(data = pred4, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[4], alpha=0.3) +
  coord_cartesian(ylim = c(0, max(ylim))) +
  labs(x = "", y = "Infected", color = "Model") +
  scale_y_continuous(minor_breaks = seq(0, max(ylim), by = 10000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Modelling total deceased

# create data for analysis
data = data.frame(date = COVID19$data,
                  y = COVID19$deceduti,
                                    dy = reldiff(COVID19$deceduti))
data$x = as.numeric(data$date) - min(as.numeric(data$date)) + 1
DT::datatable(data, options = list("pageLength" = 5))

Estimation

Exponential

mod1_start = lm(log(y) ~ x, data = data)
b = unname(coef(mod1_start))
start = list(th1 = exp(b[1]), th2 = b[2])
exponential <- function(x, th1, th2) th1 * exp(th2 * x)
mod1 = nls(y ~ exponential(x, th1, th2), data = data, start = start)
summary(mod1)
## 
## Formula: y ~ exponential(x, th1, th2)
## 
## Parameters:
##        Estimate  Std. Error t value          Pr(>|t|)    
## th1 3227.344471  360.681690   8.948 0.000000000000209 ***
## th2    0.032187    0.001771  18.178           < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3697 on 74 degrees of freedom
## 
## Number of iterations to convergence: 13 
## Achieved convergence tolerance: 0.000005387

Logistic

mod2 = nls(y ~ SSlogis(x, Asym, xmid, scal), data = data)
summary(mod2)
## 
## Formula: y ~ SSlogis(x, Asym, xmid, scal)
## 
## Parameters:
##        Estimate Std. Error t value Pr(>|t|)    
## Asym 29720.5451   311.0309   95.56   <2e-16 ***
## xmid    41.6186     0.3406  122.18   <2e-16 ***
## scal     9.1724     0.2547   36.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 794.9 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000001991

Gompertz

mod3 = nls(y ~ SSgompertz(x, Asym, b2, b3), data = data)
# manually set starting values
# start = list(Asym = coef(mod2)[1])
# tmp = list(y = log(log(start$Asym) - log(data$y)), x = data$x)
# b = unname(coef(lm(y ~ x, data = tmp)))
# start = c(start, c(b2 = exp(b[1]), b3 = exp(b[2])))
# mod3 = nls(y ~ SSgompertz(x, Asym, b2, b3), data = data, start = start, 
#            control = nls.control(maxiter = 10000))
summary(mod3)
## 
## Formula: y ~ SSgompertz(x, Asym, b2, b3)
## 
## Parameters:
##           Estimate    Std. Error t value Pr(>|t|)    
## Asym 32741.3895198   185.7802097  176.24   <2e-16 ***
## b2      10.1613720     0.2365107   42.96   <2e-16 ***
## b3       0.9396022     0.0007131 1317.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 268.2 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000002826

Richards

richards <- function(x, th1, th2, th3) th1*(1 - exp(-th2*x))^th3
Loss  <- function(th, y, x) sum((y - richards(x, th[1], th[2], th[3]))^2) 
start <- optim(par = c(coef(mod2)[1], 0.001, 1), fn = Loss, 
               y = data$y, x = data$x)$par
names(start) <- c("th1", "th2", "th3")
mod4 = nls(y ~ richards(x, th1, th2, th3), data = data, start = start,
           # trace = TRUE, algorithm = "port", 
           control = nls.control(maxiter = 1000))
summary(mod4)
## 
## Formula: y ~ richards(x, th1, th2, th3)
## 
## Parameters:
##         Estimate   Std. Error t value Pr(>|t|)    
## th1 33550.100385   181.163955  185.19   <2e-16 ***
## th2     0.056199     0.000704   79.82   <2e-16 ***
## th3     7.711157     0.179097   43.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 210.4 on 73 degrees of freedom
## 
## Number of iterations to convergence: 4 
## Achieved convergence tolerance: 0.000001877

Models comparison

models = list("Exponential model" = mod1, 
              "Logistic model" = mod2, 
              "Gompertz model" = mod3,
              "Richards model" = mod4)
tab = data.frame(loglik = sapply(models, logLik),
                 df = sapply(models, function(m) attr(logLik(m), "df")),
                 Rsquare = sapply(models, function(m) 
                                  cor(data$y, fitted(m))^2),
                 AIC = sapply(models, AIC),
                 AICc = sapply(models, AICc),
                 BIC = sapply(models, BIC))
sel <- apply(tab[,4:6], 2, which.min)
tab$"" <- sapply(tabulate(sel, nbins = length(models))+1, symnum,
                 cutpoints = 0:4, symbols = c("", "*", "**", "***"))
knitr::kable(tab)
loglik df Rsquare AIC AICc BIC
Exponential model -731.1790 3 0.9019452 1468.358 1468.691 1475.350
Logistic model -613.8494 4 0.9957906 1235.699 1236.262 1245.022
Gompertz model -531.2757 4 0.9994486 1070.551 1071.115 1079.874
Richards model -512.8444 4 0.9996524 1033.689 1034.252 1043.012 ***
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(aes(y = fitted(mod1), color = "Exponential")) +
  geom_line(aes(y = fitted(mod2), color = "Logistic")) +
  geom_line(aes(y = fitted(mod3), color = "Gompertz")) +
  geom_line(aes(y = fitted(mod4), color = "Richards")) +
  labs(x = "", y = "Deceased", color = "Model") +
  scale_color_manual(values = cols) +
  scale_y_continuous(breaks = seq(0, coef(mod2)[1], by = 1000),
                     minor_breaks = seq(0, coef(mod2)[1], by = 500)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

last_plot() +
  scale_y_continuous(trans = "log10", limits = c(10,NA)) +
  labs(y = "Deceased (log10 scale)")

Predictions

Point estimates

df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1),
               fit1 = predict(mod1, newdata = df),
               fit2 = predict(mod2, newdata = df),
               fit3 = predict(mod3, newdata = df),
               fit4 = predict(mod4, newdata = df))
ylim = c(0, max(df[,-(1:3)]))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(data = df, aes(x = date, y = fit1, color = "Exponential")) +
  geom_line(data = df, aes(x = date, y = fit2, color = "Logistic")) +
  geom_line(data = df, aes(x = date, y = fit3, color = "Gompertz")) +
  geom_line(data = df, aes(x = date, y = fit4, color = "Richards")) +
  coord_cartesian(ylim = ylim) +
  labs(x = "", y = "Deceased", color = "Model") +
  scale_y_continuous(breaks = seq(0, max(ylim), by = 1000),
                     minor_breaks = seq(0, max(ylim), by = 1000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Prediction intervals

# compute prediction using Moving Block Bootstrap (MBB) for nls
df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1))

pred1 = cbind(df, "fit" = predict(mod1, newdata = df))
pred1[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod1, df[df$x > max(data$x),])[,2:3]

pred2 = cbind(df, "fit" = predict(mod2, newdata = df))
pred2[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod2, df[df$x > max(data$x),])[,2:3]

pred3 = cbind(df, "fit" = predict(mod3, newdata = df))
pred3[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod3, df[df$x > max(data$x),])[,2:3]

pred4 = cbind(df, "fit" = predict(mod4, newdata = df))
pred4[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod4, df[df$x > max(data$x),])[,2:3]

# predictions for next day
pred = rbind(subset(pred1, x == max(data$x)+1, select = 2:5),
             subset(pred2, x == max(data$x)+1, select = 2:5),
             subset(pred3, x == max(data$x)+1, select = 2:5),
             subset(pred4, x == max(data$x)+1, select = 2:5))
print(pred, digits = 3)
##           date   fit   lwr   upr
## 77  2020-05-10 38477 29043 48369
## 771 2020-05-10 29106 27251 30658
## 772 2020-05-10 30107 29487 30697
## 773 2020-05-10 30282 29772 30762

ylim = c(0, max(pred2$upr, pred3$upr, na.rm=TRUE))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(data = pred1, aes(x = date, y = fit, color = "Exponential")) +
  geom_line(data = pred2, aes(x = date, y = fit, color = "Logistic")) +
  geom_line(data = pred3, aes(x = date, y = fit, color = "Gompertz")) +
  geom_line(data = pred4, aes(x = date, y = fit, color = "Richards")) +
  geom_ribbon(data = pred1, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[1], alpha=0.3) +
  geom_ribbon(data = pred2, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[2], alpha=0.3) +
  geom_ribbon(data = pred3, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[3], alpha=0.3) +
  geom_ribbon(data = pred4, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[4], alpha=0.3) +
  coord_cartesian(ylim = c(0, max(ylim))) +
  labs(x = "", y = "Deceased", color = "Model") +
  scale_y_continuous(minor_breaks = seq(0, max(ylim), by = 1000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Modelling recovered

# create data for analysis
data = data.frame(date = COVID19$data,
                  y = COVID19$dimessi_guariti,
                                    dy = reldiff(COVID19$dimessi_guariti))
data$x = as.numeric(data$date) - min(as.numeric(data$date)) + 1
DT::datatable(data, options = list("pageLength" = 5))

Estimation

Exponential

mod1_start = lm(log(y) ~ x, data = data)
b = unname(coef(mod1_start))
start = list(th1 = exp(b[1]), th2 = b[2])
exponential <- function(x, th1, th2) th1 * exp(th2 * x)
mod1 = nls(y ~ exponential(x, th1, th2), data = data, start = start)
summary(mod1)
## 
## Formula: y ~ exponential(x, th1, th2)
## 
## Parameters:
##        Estimate  Std. Error t value Pr(>|t|)    
## th1 2957.800379  221.992643   13.32   <2e-16 ***
## th2    0.047651    0.001123   42.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4236 on 74 degrees of freedom
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 0.000001772

Logistic

mod2 = nls(y ~ SSlogis(x, Asym, xmid, scal), data = data)
summary(mod2)
## 
## Formula: y ~ SSlogis(x, Asym, xmid, scal)
## 
## Parameters:
##         Estimate  Std. Error t value Pr(>|t|)    
## Asym 138548.3407   4762.3218   29.09   <2e-16 ***
## xmid     64.4445      0.9692   66.49   <2e-16 ***
## scal     12.8647      0.3231   39.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1717 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000001305

Gompertz

mod3 = nls(y ~ SSgompertz(x, Asym, b2, b3), data = data)
summary(mod3)
## 
## Formula: y ~ SSgompertz(x, Asym, b2, b3)
## 
## Parameters:
##            Estimate     Std. Error t value Pr(>|t|)    
## Asym 314160.4426424  19474.0383548   16.13   <2e-16 ***
## b2        7.7079292      0.1194179   64.55   <2e-16 ***
## b3        0.9751808      0.0008332 1170.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1011 on 73 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000003518

Richards

richards <- function(x, th1, th2, th3) th1*(1 - exp(-th2*x))^th3
Loss  <- function(th, y, x) sum((y - richards(x, th[1], th[2], th[3]))^2) 
start <- optim(par = c(coef(mod2)[1], 0.001, 1), fn = Loss, 
               y = data$y, x = data$x)$par
names(start) <- c("th1", "th2", "th3")
mod4 = nls(y ~ richards(x, th1, th2, th3), data = data, start = start,
           # trace = TRUE, # algorithm = "port", 
           control = nls.control(maxiter = 1000))
summary(mod4)
## 
## Formula: y ~ richards(x, th1, th2, th3)
## 
## Parameters:
##           Estimate     Std. Error t value       Pr(>|t|)    
## th1 1078525.982837  260899.630834   4.134 0.000094092976 ***
## th2       0.009013       0.001271   7.090 0.000000000709 ***
## th3       3.374017       0.133252  25.320        < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 854 on 73 degrees of freedom
## 
## Number of iterations to convergence: 33 
## Achieved convergence tolerance: 0.000002631

Models comparison

models = list("Exponential model" = mod1, 
              "Logistic model" = mod2, 
              "Gompertz model" = mod3,
              "Richards model" = mod4)
tab = data.frame(loglik = sapply(models, logLik),
                 df = sapply(models, function(m) attr(logLik(m), "df")),
                 Rsquare = sapply(models, function(m) 
                                  cor(data$y, fitted(m))^2),
                 AIC = sapply(models, AIC),
                 AICc = sapply(models, AICc),
                 BIC = sapply(models, BIC))
sel <- apply(tab[,4:6], 2, which.min)
tab$"" <- sapply(tabulate(sel, nbins = length(models))+1, symnum,
                 cutpoints = 0:4, symbols = c("", "*", "**", "***"))
knitr::kable(tab)
loglik df Rsquare AIC AICc BIC
Exponential model -741.5378 3 0.9852509 1489.076 1489.409 1496.068
Logistic model -672.3783 4 0.9973955 1352.757 1353.320 1362.080
Gompertz model -632.1505 4 0.9989961 1272.301 1272.864 1281.624
Richards model -619.3014 4 0.9992579 1246.603 1247.166 1255.926 ***
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(aes(y = fitted(mod1), color = "Exponential")) +
  geom_line(aes(y = fitted(mod2), color = "Logistic")) +
  geom_line(aes(y = fitted(mod3), color = "Gompertz")) +
  geom_line(aes(y = fitted(mod4), color = "Richards")) +
  labs(x = "", y = "Recovered", color = "Model") +
  scale_color_manual(values = cols) +
  scale_y_continuous(breaks = seq(0, coef(mod2)[1], by = 1000),
                     minor_breaks = seq(0, coef(mod2)[1], by = 500)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

last_plot() +
  scale_y_continuous(trans = "log10", limits = c(10,NA)) +
  labs(y = "Recovered (log10 scale)")

Predictions

Point estimates

df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1),
               fit1 = predict(mod1, newdata = df),
               fit2 = predict(mod2, newdata = df),
               fit3 = predict(mod3, newdata = df),
               fit4 = predict(mod4, newdata = df))
ylim = c(0, max(df[,-(1:3)]))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() + 
  geom_line(data = df, aes(x = date, y = fit1, color = "Exponential")) +
  geom_line(data = df, aes(x = date, y = fit2, color = "Logistic")) +
  geom_line(data = df, aes(x = date, y = fit3, color = "Gompertz")) +
  geom_line(data = df, aes(x = date, y = fit4, color = "Richards")) +
  coord_cartesian(ylim = ylim) +
  labs(x = "", y = "Recovered", color = "Model") +
  scale_y_continuous(breaks = seq(0, max(ylim), by = 1000),
                     minor_breaks = seq(0, max(ylim), by = 1000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Prediction intervals

# compute prediction using Moving Block Bootstrap (MBB) for nls
df = data.frame(x = seq(min(data$x), max(data$x)+14))
df = cbind(df, date = as.Date(df$x, origin = data$date[1]-1))

pred1 = cbind(df, "fit" = predict(mod1, newdata = df))
pred1[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod1, df[df$x > max(data$x),])[,2:3]

pred2 = cbind(df, "fit" = predict(mod2, newdata = df))
pred2[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod2, df[df$x > max(data$x),])[,2:3]

pred3 = cbind(df, "fit" = predict(mod3, newdata = df))
pred3[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod3, df[df$x > max(data$x),])[,2:3]

pred4 = cbind(df, "fit" = predict(mod4, newdata = df))
pred4[df$x > max(data$x), c("lwr", "upr")] = predictMBB.nls(mod4, df[df$x > max(data$x),])[,2:3]

# predictions for next day
pred = rbind(subset(pred1, x == max(data$x)+1, select = 2:5),
             subset(pred2, x == max(data$x)+1, select = 2:5),
             subset(pred3, x == max(data$x)+1, select = 2:5),
             subset(pred4, x == max(data$x)+1, select = 2:5))
print(pred, digits = 3)
##           date    fit    lwr    upr
## 77  2020-05-10 116000 103887 127614
## 771 2020-05-10 100629  95830 104350
## 772 2020-05-10 103225 100775 105234
## 773 2020-05-10 104323 102731 106091

ylim = c(0, max(pred2$upr, pred3$upr, na.rm=TRUE))
ggplot(data, aes(x = date, y = y)) + 
  geom_point() +
  geom_line(data = pred1, aes(x = date, y = fit, color = "Exponential")) +
  geom_line(data = pred2, aes(x = date, y = fit, color = "Logistic")) +
  geom_line(data = pred3, aes(x = date, y = fit, color = "Gompertz")) +
  geom_line(data = pred4, aes(x = date, y = fit, color = "Richards")) +
  geom_ribbon(data = pred1, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[1], alpha=0.3) +
  geom_ribbon(data = pred2, aes(x = date, ymin = lwr, ymax = upr), 
              inherit.aes = FALSE, fill = cols[2], alpha=0.3) +
  geom_ribbon(data = pred3, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[3], alpha=0.3) +
  geom_ribbon(data = pred4, aes(x = date, ymin = lwr, ymax = upr),
              inherit.aes = FALSE, fill = cols[4], alpha=0.3) +
  coord_cartesian(ylim = c(0, max(ylim))) +
  labs(x = "", y = "Recovered", color = "Model") +
  scale_y_continuous(breaks = seq(0, max(ylim), by = 5000),
                     minor_breaks = seq(0, max(ylim), by = 1000)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = cols) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Description of evolution

Positive cases and administered swabs

df = data.frame(date = COVID19$data,
                positives = c(NA, diff(COVID19$totale_casi)),
                swabs = c(NA, diff(COVID19$tamponi)))
df$x = as.numeric(df$date) - min(as.numeric(df$date)) + 1
# df$y = df$positives/df$swabs
df$y = df$positives/c(NA, zoo::rollmean(df$swabs, 2))
df = subset(df, swabs > 50)
# DT::datatable(df[,-4], )
ggplot(df, aes(x = date)) + 
  geom_point(aes(y = swabs, color = "swabs"), pch = 19) +
  geom_line(aes(y = swabs, color = "swabs")) +
  geom_point(aes(y = positives, color = "positives"), pch = 0) +
  geom_line(aes(y = positives, color = "positives")) +
  labs(x = "", y = "Number of cases", color = "") +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  scale_color_manual(values = palette()[c(2,1)]) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

ggplot(df, aes(x = date, y = y)) + 
  geom_smooth(method = "loess", se = TRUE, col = "black") +
  geom_point(col=palette()[4]) + 
  geom_line(size = 0.5, col=palette()[4]) +
  labs(x = "", y = "% positives among admnistered swabs (two-day rolling mean)") +
  scale_y_continuous(labels = scales::percent_format(),
                     breaks = seq(0, 0.5, by = 0.05)) +
  coord_cartesian(ylim = c(0,max(df$y, na.rm = TRUE))) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

Hospitalized and ICU patients

df = data.frame(date = COVID19$data,
                hospital = c(NA, diff(COVID19$totale_ospedalizzati)),
                icu = c(NA, diff(COVID19$terapia_intensiva)))
df$x = as.numeric(df$date) - min(as.numeric(df$date)) + 1
ggplot(df, aes(x = date, y = hospital)) + 
  geom_smooth(method = "loess", se = TRUE, col = "black") +
  geom_point(col = "orange") + 
  geom_line(size = 0.5, col = "orange") +
  labs(x = "", y = "Change hospitalized patients") +
  coord_cartesian(ylim = range(df$hospital, na.rm = TRUE)) +
  scale_y_continuous(minor_breaks = seq(min(df$hospital, na.rm = TRUE),
                                        max(df$hospital, na.rm = TRUE), 
                                        by = 100)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))

ggplot(df, aes(x = date, y = icu)) + 
  geom_smooth(method = "loess", se = TRUE, col = "black") +
  geom_point(col = "red2") + 
  geom_line(size = 0.5, col = "red2") +
  labs(x = "", y = "Change ICU patients") +
  coord_cartesian(ylim = range(df$icu, na.rm = TRUE)) +
  scale_y_continuous(minor_breaks = seq(min(df$icu, na.rm = TRUE), 
                                        max(df$icu, na.rm = TRUE), 
                                        by = 10)) +
  scale_x_date(date_breaks = "2 day", date_labels =  "%b%d",
               minor_breaks = "1 day") +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle=60, hjust=1))